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1. OVERVIEW 


In the network security space, the negotiation between “security and convenience” is often thought of as the balancing 
act between security function and network performance. Every security function, even a simple port number check or 
IP address comparison, will introduce some performance impacts commonly manifested by the following: 

e Latency 

e Throughput 

e Packet drops 
Among all the network security functions, flow-based deep packet inspection (DPI) in short flow inspection, is 
one of the most expensive. A genuine flow inspection system must check and compare every byte traversing a 
network. Consequently, flow inspection systems are most susceptible to network performance impacts. 


To mitigate network performance impacts, network security vendors that support flow inspection have devised 
different techniques to balance between security effectiveness and performance impacts. Note, that ifa network 
security system does not check nor compare any bytes traversing the network, it will not incur any network 
performance impacts. The system will be able to achieve the network performance characteristics comparable to 
network bridges, switches, and routers. 


The flow inspection system that Trend Micro uses, TippingPoint™ Threat Prevention System (TPS), also must 
balance between security effectiveness and performance impacts. This white paper outlines techniques that Trend 
Micro has developed and adopted into the TippingPoint TPS product line and provides guidelines for selecting the 
right product models to meet customer needs of network performance. 
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2. TPS PERFORMANCE ARCHITECTURE 


Trend Micro has taken a three-pronged approach in its efforts to balance security effectiveness and 
performance impacts. 

1. Divide and conquer. 

2. Hierarchy. 

3. Hardware acceleration. 


2.1. Divide and Conquer Technique 


Most network security vendors take a different approach to mitigate performance impacts that result from the 
execution of security functions. They distribute security functions across multiple processing units so that 
security functions required by different flows are parallelized. Each processing unit will complete the execution 
of all the security functions for a flow before starting to take on another flow. This practice is commonly known 
as run-to-completion. 


Trend Micro takes a divide-and-conquer approach. We divide supported security functions into different and 
independent working pieces. We then distribute those working pieces among available processing units. Each 
processing unit only performs a portion of security functions. To complete the required security functions, a 
flow must be scheduled on multiple processing units responsible for different portions of security functions. 
Such a practice is commonly referred to as pipelining. 


2.2. Hierarchy Technique 


In addition to dividing security functions into different working pieces, Trend Micro takes a hierarchical approach 
to divide the security functions. The hierarchy is defined following these criteria: 

e Tasks that can be completed without additional processing rank higher in the hierarchy. 

e Tasks that cannot be parallelized, such as tasks that require a state machine, are lower in the hierarchy. 

e Expensive tasks also rank lower in the hierarchy. 


For example, tasks that check packet integrity and that don't require any state machine are higher in the 
hierarchy. Because tasks that check packet headers are less expensive than tasks that check packet payloads, 

they would also be higher in the hierarchy. Note that the tasks in these examples also do not require additional 
processing if their results are determined to be invalid. This means that they do not need to be scheduled to other 
processing units responsible for lower hierarchy tasks. 


This hierarchy technique of dividing security functions into the right portions is one of the most innovative 
features of the TPS. The hierarchy must be defined not only to meet the above criteria but also to keep the 
divided security functions in the right order and sequence. Its far-reaching implications on the system extend to 
algorithms used to complete those hierarchical tasks. 
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2.2.1. Security Pattern Hierarchy 


The hierarchical approach is not only applied to security functions. It is also applied to security patterns (such as 
rules or signatures). TPS security patterns, which are captured in digital vaccines (DV), are defined hierarchically: 
e Higher hierarchy subpatterns are less expensive to match. 
e Higher hierarchy subpatterns also establish the necessary condition for the lower hierarchy subpatterns. 


In other words, matching higher hierarchy subpatterns is faster than matching lower hierarchy subpatterns, and if 
the higher hierarchy subpatterns are not matched in flow packets, then there is no need to perform the matching 
of lower hierarchy subpatterns. 


This is why our DVs contain filters. The higher hierarchical subpatterns provide more high level filtering, while the 
lower hierarchical subpatterns provide more refined filtering. Only packets that are caught by the more high-level 
filters need to be processed further by the more refined filters. 


2.3. Hardware Acceleration 


Even with these approaches and techniques, general-purpose computing chips like CPU or NPU (network 
processing unit) will not be able to complete those tasks when the volume of network traffic reaches a certain 
level, typically around 5 to 10 Gbps. Every network security vendor relies on hardware acceleration techniques to 
achieve higher throughput. 


Trend Micro TippingPoint product lines use FPGA to accelerate flow inspection. Compared with other hardware 
acceleration techniques, FPGA is unique in its programmability. FPGA is the only hardware solution capable of 
delivering software-defined hardware acceleration. In addition, FPGA enables us to continuously add new and 
different hardware acceleration functions to meet different needs of flow inspection. 


2.4. SSL Inspection Performance Architecture 


Inspecting SSL flows will incur additional performance impacts to TPS appliances. In order to inspect an SSL 
connection, a pair of unidirectional flows, the TPS must perform these tasks: 

e Establish a TCP proxy connection. 

e Establish an SSL/TLS proxy session. 

e Decrypt SSL/TLS proxy session traffic and re-encrypt the traffic after its inspection. 


These SSL-specific tasks are expensive, particularly when those tasks are performed by general purpose CPUs 
or NPUs. Not only do those tasks require a state machine, they are also compute-intensive because of required 
crypto functions. 
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We take the similar divide-and-conquer approach to split these SSL specific functions into different working 
pieces so that they can be pipelined into different processing units. However, we divide them logically instead of 
hierarchically, following the sequence of SSL/TLS proxy operations. 


We also apply hardware acceleration techniques to some of the compute-intensive tasks, specifically crypto- 
related functions. Naturally, the same divide-and-conquer and hierarchy approaches are taken for security 
functions that are applicable to SSL traffic, such as SSL payload inspection. 


2.5. TPS System Architecture 


The following diagram shows the overall TPS system architecture. It implements the three-prong approach we 
have discussed to achieve an optimal balance between security effectiveness and performance impacts. 
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Figure 1. TPS Performance Pipelining Architecture 
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TPS systems consist of four layers of functional hierarchy distributed across multiple sets of CPU processing units, 
as well as FPGA on those TPS systems that include FPGA hardware. As shown in Figure 1, each layer acts as a filter 
to screen out “clean” traffic (green arrowed lines) and send “dirty” traffic (yellow arrowed lines) to the next layer 


of filters. Only at the last layer of filters does the traffic get identified as malicious, and the security policies get 
applied and enforced (red line with stop sign). 
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Each layer of hierarchy can have different set sizes of processing units. The set size is determined by the functional 
complexity and capacity in the hierarchy. The total size of processing units will equal the hardware capacity. On 
TPS systems that do not have FPGA (1100TX and 5500TX models), functions of all four layers of hierarchy will be 
executed by four sets of CPU processing units. On TPS systems that do have FPGA (8200TX and 8400TX), functions 
in the first layer of functional hierarchy will be executed by FPGA and the remaining ones will be executed by three 
sets of CPU processing units. 


While we use a similar hierarchy to carry out SSL inspection, there is a notable difference between the inspection 
operations of SSL traffic and non-SSL traffic. For non-SSL traffic, the first layer of functional hierarchy is executed 
by FPGA when FPGA is available in the system. For SSL traffic, all the four layers of inspection functional hierarchy 
are executed by general purpose CPU processing units. The hardware acceleration approach only applies to SSL 
crypto functions. 


Similarly the security patterns in the DV are defined by three layers of subfilter hierarchy. On TPS systems that 
have FPGA, matching subfilters in the first layer of hierarchy is performed by FPGA. Matching subfilters in other 
layers of hierarchy is accomplished by the CPU. On TPS systems that do not have FPGA, subfilter matching in all 
the hierarchical layers is executed by CPU. 
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3. TPS PERFORMANCE CHARACTERIZATION 


So far we have discussed the TPS performance architecture results under unique performance circumstances. 


1. Low latency of “clean” traffic. The “cleaner” the traffic is, the faster the traffic will be forwarded by the 
system. Traffic that is “clean” after the first layer of functional hierarchy will be forwarded by the TPS without 
being processed by the other three layers of functional hierarchy. Similarly traffic that is “dirty” after the first 
layer of functional hierarchy but “clean” after the second layer of functional hierarchy will be forwarded by 
the TPS without being processed by the remaining two layers of functional hierarchy. 


On a TPS system that has FPGA, traffic that is “clean” (green line in Figure 1) after the first layer of functional 
hierarchy, which is executed by FPGA, will be forwarded by the TPS without requiring any CPU processing. 
Note that the typical FPGA processing latency is within the single digits of microseconds. 


2. “Dirtier” traffic does not clog “cleaner” traffic. Traffic that is “dirty” after one layer of functional hierarchy 
will not stay in the same layer for further processing by higher layers of functional hierarchy. That traffic 
will be queued to different sets of processing units responsible for higher layers of functional hierarchy. The 
processing units will remain available in the given layer of functional hierarchy to process incoming traffic, 
and “clean” traffic will be continuously forwarded by the TPS system. 


3. Only “dirty” traffic pays a latency penalty. The “dirtier” the traffic is, the higher the latency penalty would 
be. At any given layer of functional hierarchy, only “dirty” traffic will be forwarded to the next layer of 
functional hierarchy for deeper inspection processing, resulting in an additional delay of its being forwarded 
from the system. Note that traffic that is “dirty” after the last layer of functional hierarchy (i.e, traffic that is 
determined to be malicious) will pay the ultimate latency penalty—it will never be forwarded by the system 
(red line with stop sign in Figure 1), resulting in infinite latency. 


3.1. Throughput Limitations 


The overall TPS system performance depends heavily on the nature and characteristics of network traffic. The 
following three are the most prominent traffic patterns that will affect TPS’ overall system performance: 

e Large volume of small-sized packets. 

e Burst of packets of different size. 

e Large number of packets that need to be processed by multiple layers of functional hierarchy. 
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3.1.1. Small Size Packets 


There will always be a fixed compute cost associated with every network packet, regardless of whether it has a 
security function or not. For example, suppose “a” is the fixed compute cost for a packet. If the total available 
fixed cost of a network system is “3” amount per second, then the total number of packets the system can process 
in a second will also be fixed as “M = >/a”. “M” is commonly referred to as the system's maximum packet-per- 
second (PPS) rate. 


The throughput of the system is defined by “MxS”, where “S” is the packet size in bits. If a network has a large 
volume of small size packets, then the throughput of the system will be small. 


For example, if “M” is 1 Mpps and “S” is 512 bits, a packet size of 64 bytes, then the system throughput will be 512 
Mbps. If the majority of packets in a network are of a large packet size, for example, if “S” is 12,144 bits, a packet 
size of 1,518 bytes, then the throughput of the same system becomes 12.144 Gbps. 


3.1.2. Packet Bursting 


In a short period of time, such as a single digit of seconds, network traffic is inherently unpredictable. Not only 
are user behaviors such as application launch, application use, and mouse click very hard to predict in a short 
period of time, but application behavior—such as the number of network connections and volume of network 
data—are also hard to predict in a short period of time. Such behavior unpredictability will result in a networking 
phenomenon commonly known as packet bursting. That is, in a short time span of seconds or milliseconds, a large 
number of packets are being pumped into a network. 


As mentioned, every network system has a fixed packet-per-second rate. If the amount of incoming packet rate 
exceeds the system’s maximum PPS rate, then some of those incoming packets will be dropped, resulting in 
different types of application performance degradation. Some factors of this could include increased latency and 
decreased throughput because of the retransmission of dropped packets. 


Packet drops caused by packet bursting will have bigger performance impacts on systems like TPS that process 
network flows. Because of packet drops, the flow inspection system will need to wait for the retransmission 

of missing packets to complete its inspection. The system needs to buffer those packets and assemble them 
into a logic flow before completing the execution of security functions. It adds burden to the system’s compute 
resources, which are shared by all the supported security functions. 


3.1.3. “Dirty” Packets 


In a multilayered hierarchy function system like TPS, each layer of hierarchy function can be considered a 
subsystem that has its own input and output queues and its own allocated processing units. It has its own 
maximum packet-per-second rate. If the packet arrival rate to the subsystem exceeds its maximum PPS rate, then 
the subsystem will behave similarly to other network systems. It will manifest the three performance impacts we 
have discussed. 


From Figure 1, you can see that the packet arrival rate to the first layer of hierarchy function is determined by 

the underlying network. And the packet arrival rates to the other layers are determined by the volume of “dirty” 
packets from each layer. If the volume of “dirty” packets reached at any layer of hierarchy function is higher than 
its PPS rate capacity, then the associated subsystem will show performance degradation. Because the subsystems 
are pipelined, a performance degradation at one subsystem will start a chain reactions at other subsystems. 
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4. SELECTION OF A RIGHT TPS MODEL 


Independent of underlying technology and techniques, the performance of any network security system will be 
heavily affected by the nature of network applications and the behavior of network application users. Selecting a 
right model of a network security system to meet the right performance expectation can be laborious. 


While over-provisioning a network security system can increase costs unnecessarily, under-provisioning the 
network security system can affect not only the application performance and user experience, it can also affect 
the security posture of the network and application. 


The first step, and perhaps one of the most commonly practiced steps in selecting a network security system, is 
to use vendor-provided performance datasheets of network security systems, in addition to network statistics 
available from networking gears such as switches and routers. This strategy often results in under-provisioning of 
network security systems. 


For an optimal strategy in selecting a network security system, customers should have a good understanding of: 
e Methodologies used to generate performance datasheets. 
e Networking characteristics of underlying network applications. 


4.1. Understanding Performance Datasheets 


For networking gears such as switches and routers, there is a well-understood and widely accepted methodology 
to measure their performance—RFC2544. This measures basic UDP packet forwarding rates (PPS) and the average 
packet forwarding latency with different sizes of packets. 


For network security systems, there is no well-defined nor commonly-adopted methodology to measure the 
overall system performance. Different vendors use different methodologies to publish their performance 
datasheets. If someone uses “vendor A's” methodology to measure “vendor B’s” system, they will get 
performance data different from those published in “vendor B’s” datasheet. 


As illustrated in section 3.1.1., by simply changing the size of testing packets to measure the system throughput, 
you can get very different throughput values, from 0.5 Gbps of small packet sizes to more than 12 Gbps of large 
packet sizes. Thus the throughput statistic on a performance datasheet can theoretically be off by a factor of more 
than 20, simply based on packet size. 


For TPS datasheets, the throughput number is measured by averaging the throughput of UDP packets of size 1,024 
bytes and the throughput of HTTP connections transferring a single web page of 21,000 bytes. Note that during 
the UDP packet throughput tests, the size of testing UDP packets are exactly the same; two-thirds the packet size 
of Ethernet’s standard maximum transmission unit (MTU). The size of testing TCP packets generated by HTTP 
connections varies. It ranges from a packet size of 20 bytes to a packet size of 1,500 bytes. 


Performance datasheets for SSL inspection add another dimension to the testing parameters. Besides packet size, 
the performance of SSL inspection can vary considerably based on the type of cipher suite and cipher key size 
used in performance testing. 
TPS uses the following cipher suites and key sizes to measure its performance of SSL inspection: 
e ECDHE for key exchange to support perfect forward secrecy (PFS). 
e RSA with key size of 2048-bit for authentication. 
e AES-GCM with key size of 256-bit and SHA-2 of 384-bit digest to support authenticated encryption with 
associated data (AEAD). 





Page 10 0f12 | Trend Micro White Paper 


Threat Prevention System Performance and Model Selection 


(O) TREND. 


These ciphers were chosen not only because they are the most popular ciphers used over the Internet, they are 
also more secure than other ciphers. As mentioned, choosing a lower grade of ciphers or removing the support of 
more secure practices (such as PFS and AEAD), will result in better performance testing results. 


For example, compare AES encryption with key size of 256-bit (AES-256) to AES encryption with key size of 128-bit 
(AES-128). From the perspective of pure compute resources required for AES operations, AES-256 consumes 40% 
more compute resources than AES-128. The implication on a datasheet would be that AES-128 is 40% faster than 
AES-256. 


4.2. Understanding Traffic Characteristics 


Gaining some insights into underlying network applications and their traffic characteristics can significantly help 
customers calculate a proper TPS capacity to support the inspection of network traffic. Customers should pay 
attention to the following three application traffic characteristics: 
e Proportion of small size packets to large size packets. 
e Proportion of short-lived TCP connections to long-lived TCP connections. 
e Proportion of small application transactions, such as voice, to large application transactions, such as video 
streams and large file transfers. 


As explained, if the majority of application traffic consists of small-sized packets, short-lived connections, 
and small application transactions, then a TPS of 40 Gbps capacity will not be able to achieve 40 Gbps 
throughput performance. 


Many networking gears provide stats that can be used to get close to those characteristics. There are also 
networking monitoring tools capable of capturing those application traffic characteristics. 


4.3. Understanding SSL Traffic 


Special attention should also be given to applications that protect their traffic with SSL. Besides knowing its 
proportion in the network, customers should know if they need to decrypt all traffic for deep inspection or inspect 
traffic without decryption. If they do need to decrypt SSL traffic for deep inspection, then the system throughout 
for inspection of non-encrypted traffic will be decreased. 


4.4. Heuristic Rules 


Because of the unpredictability and complexity of network traffic characteristics, it is not practical to establish 
a well-defined formula to determine the inspection capacity and to select the ideal TPS model to support that 
capacity. There are, however, some heuristic rules customers can apply for better TPS capacity planning. 
1. For generic network traffic, if the networking stats are showing “X” amount of traffic passing through a 
network link, then select a TPS capacity of 1.3 times the value of “X”. 
2. For application traffic that would require multiple layers of functional hierarchy operations, select a TPS 
capacity with a multiplier of 2.5. 
3. For every gigabit-per-second SSL inspection capacity, allocate a TPS capacity with a multiplier of 4. 


The first rule is intended to address traffic unpredictability, such as packet bursting. For example, if a switch or 
router interface shows that the peak throughput of a network link is 1.5 Gbps, then the required TPS inspection 
capacity for the link is 2 Gbps. 
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The second rule is intended to address certain types of traffic that would result in large portions of “dirty” 

traffic across multiple layers of functional hierarchy. There is no straightforward way to apply the second rule. It 
requires knowledge of traffic type and statistics. It also requires understanding the TPS security policy used to 
inspect the traffic. TPS systems come with a default security policy or recommended profile. Several filters in the 
recommended profile are disabled by default to reduce the volume of potential “dirty” traffic across multiple 
layers of functional hierarchy. If a security policy needs to enable those filters for a given traffic type, then the 
second rule can be applied to the traffic of the specific type. For example, with SMB traffic, a number of SMB- 
specific subfilters will mark SMB traffic as “dirty” at multiple layers. Many of those filters are disabled in the 
default security policy. If the average amount of SMB traffic is 2 Gbps and the TPS user needs to enable those 
SMB-specific filters, then the required TPS inspection capacity for the SMB traffic is 5 Gbps. 


The third rule is intended to address the overhead of TCP proxy, SSL proxy, and decryption and re-encryption 
required by SSL inspection. For example, if a network link carries 1 Gbps SSL traffic and SSL inspection is needed 
for the SSL traffic, then the required TPS inspection capacity is 4 Gbps. 


When we combine all the preceding sample traffic together and apply the corresponding rules, the total 4.5 Gbps 
network traffic includes: 

e 1 Gbps SSL traffic. 

e 2 Gbps SMB traffic. 

e 1.5 generic non-SSL traffic. 
So the required total TPS inspection capacity would be 11 Gbps. 


Note that these rules are heuristic. They provide hints for initial planning of TPS capacity. For unknown and 
complex network traffic characteristics, adopting a pilot program might be a better approach. 


4.5. A Final Note 


As described in section 3, the TPS has several performance limitations. Some of those limitations are hard 

to assess if you take into account different network environments, traffic characteristics, and TPS filter 
configurations. Nevertheless, Trend Micro is aware of these limitations and their causes. We can and will continue 
to improve the TPS performance and reduce those limitations. 
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